Tokenize is free for personal use only. You are encouraged to redistribute it but this document must be included. Cheap licenses are available for site or commercial distribution. Contact me at one of the addresses below for details (electronic contact preferred). THE AUTHOR PROVIDES NO WARRANTIES FOR THIS SOFTWARE. USE AT YOUR OWN RISK!
The demo AppleScript included with the distribution contains many examples of Tokenize's usage.
Tokenize was designed to make it easier to split text into elements based on a set of delimiters. The demo AppleScript illustrates several novel uses for Tokenize which may not be obvious at first glance.
INSTALLATION:
______________________
To install: Drag the Tokenize to the Scripting Additions folder inside the Extensions folder.
BACKGROUND INFORMATION
______________________
Because of the way the tokenization is implemented, Tokenize can also be used as a quick way of removing unwanted characters from a text string. To better understand what is possible with Tokenize, here's a brief description of how Tokenize functions. The text to be tokenized is scanned for each of the strings given in the delimiter list, and all occurrences of these strings are replaced by a special character (essentially a null-char). After all delimiters are processed, a final pass is made which gathers all the strings between the special characters into a list. Understanding this algorithm will help you to figuring out how text will ultimately be parsed when using Tokenize.
For example, consider an arbitrary string of text which contains words separated by tab characters, and between each word there will be one to three tabs. Here's a string set up as described:
set testString to "One\tGiant\t\tStep\tFor\t\t\tMankind"
If I tokenize this string using tab as the only delimiter, It returns this list:
tokenize testString with delimiters tab
=> {"One", "Giant", "Step", "For", "Mankind"}
If, on the other hand, I tokenize using a string of three tabs, the output is different:
tokenize testString with delimiters tab & tab & tab
=> {"One Giant Step For", "Mankind"}
The output from this version consists of a list of two strings. Since tokenize only found one place in the testString where there were three tab characters side by side it split the string there. Tokenizing with a two tab string would produce yet a different result.
the direct parameter to tokenize is a string, and the second (required) parameter is a list of strings (one or more bytes in length) to use in tokenizing the direct parameter.
If you are only tokenizing with one delimiter you need not pass it as a list since AppleScript will handle the coercion for you. For example, the following is legal:
tokenize "My Name Is" with delimiters " Name "
=> {"My","Is"}
Some text processing tasks require more than one call to Tokenize to perform. As an example, if the variable myText contained a number of lines separated by return characters, and you wanted to retrieve the words from line five, you could write the following AppleScript commands:
tokenize myText with delimiters {return} tokenize (item 5 of result) with delimiters {space}
=> [result is a list with all the words from line five of the text]
______________________
Comments, bug reports and suggestions are welcomed. Source code is available for $40, and I'll include the sources to any other OSAXen I may have written up to that point. If you have any ideas for useful Scripting Additions which haven't been written yet, send me a message describing your idea.